Monthly averages of temperature and humidity of 14 wheather stations in Switzerland are available here. It reaches way back to 1867. Let's see what we can learn about
In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as pl
import statsmodels.formula.api as sm
import patsy as pt
from statsmodels.graphics.gofplots import qqplot
from datetime import datetime
In [2]:
import zipfile
from urllib.request import urlopen
# Get remote data:
url = 'http://data.geo.admin.ch/ch.meteoschweiz.homogenereihen/VQAA60.csv'
# Read into DataFrame:
data = pd.read_csv( urlopen(url), sep='|',
skiprows=[0,1,2,3],
na_values=['-'],
names=['Station', 'Time', 'H', 'T'],
parse_dates=['Time'],
date_parser=lambda d: pd.to_datetime(d, format='%Y%m') )
# statsmodels does not work so well with nans. Get rid of them:
data.dropna( inplace=True )
In [3]:
data[:6]
Out[3]:
Ok, we have an integer index, station names (abreviated), and the time of measurement. The actual data columns are H (rain fall in millimeters) and T (temperature in $^{o}$C).
Let's create a hierarchical index using Station and Time. Furthermore, I want to know the month (for later fitting of the season), the year, and the time since the start of the measurements:
In [4]:
data['Month'] = pd.DatetimeIndex(data['Time']).month
data['Year'] = pd.DatetimeIndex(data['Time']).year
dt = data['Time'] - data['Time'].min()
data['dt'] = dt / pd.Timedelta(1, 'Y')
# Drop station 'PAY' as all data there is NaN:
data = data[data['Station'] != 'PAY']
# Create 2-level index (Station, Time):
data = data.set_index(pd.MultiIndex.from_arrays(
[data['Station'], data['Time']]),
drop=False)
data[:6][['H', 'T', 'Month', 'Time']]
Out[4]:
The data for 'BAS' (probably Basel) is shown below. As one would expect, the temperature is cyclic. Summer is warmer than winter. The precipitation looks mainly noisy. Taking means for each month (see table), however, hints that there is systematically more precipitation in summer than in winter.
In [5]:
fig, ax1 = pl.subplots()
pl.title('Temperature in BAS')
ax1.set_xlabel('Date')
ax1.set_ylabel('Temperature ($^{o}$ C)', color='r')
ax1.plot_date(data.index.levels[1].astype(datetime), data.loc['BAS']['T'],
r'ro-', label='T')
ax2 = ax1.twinx()
ax2.set_ylabel('Rainfall (mm)', verticalalignment='bottom',
rotation=-90, color='b')
ax2.plot_date(data.index.levels[1].astype(datetime), data.loc['BAS']['H'],
r'bo-', label='H')
ax1.set_xlim(['1867', '1877'])
data.groupby('Month').aggregate(['mean', 'std'])[['H', 'T']]
Out[5]:
Taking means by stations tells the warmer from the colder and the wetter from the drier locatiats:
In [6]:
data.groupby('Station').aggregate( ['mean', 'std'] )[ ['H', 'T'] ]
Out[6]:
In [7]:
# Helpers
# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
def _extractLevels(categoricalVars):
out = [name.split('[')[1][:-1] for name in categoricalVars]
return(out)
def extractLevels(categoricalVarNames):
"""
Extracts the level (or levels, in case of interaction terms from a list
of column names (of a design matrix created by patsy).
Example:
input: [ 'C(Month)[T.2]:Station[T.BER]', 'C(Month)[T.3]:Station[T.BER]' ]
output: [ ('2', 'BER'), ('3', 'BER') ]
"""
out = [name.split(':') for name in categoricalVarNames]
out = [_extractLevels(name) for name in out]
return(out)
# Season:Station
# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
data.dropna(inplace=True)
y, X = pt.dmatrices('T ~ C(Month):Station - 1', data=data)
model = sm.OLS(y, X)
fit = model.fit()
# print(fit.summary())
interactionSlice = X.design_info.term_name_slices['C(Month):Station']
interactionTemp = fit.params[interactionSlice]
interactionNames = X.design_info.column_names[interactionSlice]
interactionNames = extractLevels(interactionNames)
interactionNames = list(zip(*interactionNames))
data['interaction'] = data['Month'].astype('str') + ':' + data['Station']
pl.figure()
pl.title('Interaction Temperature-Deviation from January in BAS')
pl.xlabel('Month (Nr.)')
pl.ylabel('Temperature Deviation ($^{o}$ C)')
effects = fit.params.reshape(-1, 12)
pl.plot(effects.T, r'o-')
pl.legend(interactionNames[1][::12]);
Having fitted the obvious dependencies, we now - hopefully - have a clearer look at the non-obvious stuff. Let's diagnose the validity of the regression, first (see plots below):
In [8]:
data['fit'] = fit.predict(exog=X)
data['resid'] = data['T'] - data['fit']
fig, ax = pl.subplots()
qqplot(data['resid'], ax=ax)
pl.figure()
pl.title('Residuals vs Month');
pl.plot(data['Month'], data['resid'], r'o')
pl.xlabel('Month (1)');
pl.ylabel('Residual ($^o$C)');
pl.figure()
pl.title('Residuals vs Station');
residStation = data.pivot(index='Time', columns='Station', values='resid')
pl.boxplot(np.asarray(residStation))
pl.xticks(range(1, len(residStation.columns)+1), residStation.columns)
pl.xlabel('Station');
pl.ylabel('Residual ($^o$C)');
pl.figure()
pl.plot_date(data['Time'].astype(datetime), data['resid'], r'o')
pl.xlabel('Time (years)');
pl.ylabel('Residual ($^o$C)');
pl.figure()
pl.plot(data['fit'], data['resid'], r'o')
pl.xlabel('Fitted Values ($^{o}$ C)');
pl.ylabel('Residual ($^o$C)');
In [9]:
modelClimateChange = sm.ols('T ~ C(Month):Station + dt - 1', data=data)
fitCC = modelClimateChange.fit()
#tuple(fitCC.conf_int().loc['dt'])
'Temperature Drift: ({0:.2f} +/- {1:.2f}) K/century (95% Conf.)'.format(
fitCC.params['dt']*1e2, fitCC.bse['dt']*1e2)
Out[9]: